Skip to content

Conversation

@tmater
Copy link
Contributor

@tmater tmater commented Jan 7, 2026

Implements VariantType extension type with VariantVector for storing variant data with metadata and value buffers. Includes reader/writer implementations and comprehensive test coverage.

What's Changed

  • Core Types: VariantType, VariantVector, and holder classes
  • Reader/Writer Support: Reader/writer implementations

Closes #GH-946.

@github-actions

This comment has been minimized.

@lidavidm
Copy link
Member

lidavidm commented Jan 7, 2026

We're about to overhaul the API for implementing extension writers so you may want to hold off

@tmater
Copy link
Contributor Author

tmater commented Jan 7, 2026

We're about to overhaul the API for implementing extension writers so you may want to hold off

Thanks @lidavidm, I will wait for that, then rebase. You are talking about the ExtensionTypeWriterFactory related changes, right?

Implements VariantType extension type with VariantVector for storing
variant data with metadata and value buffers. Includes reader/writer
implementations and comprehensive test coverage.
Comment on lines +148 to +151
@Override
public Object getObject(int index) {
return getUnderlyingVector().getObject(index);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if it is useful this way. I don't have a strong opinion though what should it return instead. @lidavidm, do you have an idea?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As it stands, this returns the raw struct row? I suppose decoding to a Java object may be what's expected but that may be expensive. Or returning an explicit Variant object that has methods for inspecting/manipulating the data in that row

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if the variant representation of parquet-java would be suitable here. That is a java object backed by nio ByteBuffers. It would be beneficial as arrow-java would not need to maintain/test the related code. Meanwhile, it does not support returning "rich" objects (e.g. timestamps) but the primitive representations only (e.g. long for a timestamp). I'm not sure how it would fit the current concept.
Alternatively, we would be able to wrap the parquet-java Variant object and extend it with those rich types the different Arrow vectors support (e.g. LocalDateTime for a timestamp nano value).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we go this route, I'd rather have an Arrow Variant to avoid having to break APIs down the line if we decide to extend functionality

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be sure I understand you correctly, @lidavidm, do you agree with the following?

Add the new parquet-java dependency and use its variant implementation as a bases of a new Arrow Variant class to be returned by VariantVector.getObject. This class can return the corresponding java objects for the related primitives according to the java objects returned by their typed vectors. No parquet-java classes shall face the public API of Arrow.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, yeah, arrow-parquet-variant might be OK to isolate the dependency.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with arrow-parquet-variant approach. It cleanly isolated the dependency (to avoid any side effect in other parts of Arrow).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I've misinterpreted @jbonofre's suggestion. I wanted to reference the parquet-java's parquet-variant module. That contains only 10 prod classes. But right, it has a dependency on other parquet-java modules as well. So, we have two options:

  • Create the separate arrow-variant module that contains any variant related classes, including the ones already included in this PR.
  • Duplicate the variant related code from parquet-java, and implement our own classes based on it in the existing vector module.

I do not have a strong opinion which one would be the best choice for Arrow.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that was my concern about parquet-variant: the "parent" and transitive dependencies. I don't think we should fetch a large part of Parquet in Arrow.

I have a preference to create arrow-variant module with clearly defined dependencies.

Thoughts ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me. @tmater, could you update your PR accordingly?

@lidavidm lidavidm added the enhancement PRs that add or improve features. label Jan 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement PRs that add or improve features.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants